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Abstract: We consider the problem of low rank matrix recovery in a 
stochastically noisy high dimensional setting. We propose a new estimator 
for the low rank matrix, based on the iterative hard thresholding method, 
and that is computationally efficient and simple. We prove that our esti¬ 
mator is optimal both in terms of the Frobenius risk, and in terms of the 
operator norm risk, i.e. in terms of the entry-wise risk uniformly over any 
change of orthonormal basis. This result allows us to provide the limiting 
distribution of the estimator. In the case where the design is Gaussian, we 
prove that the entry-wise bias of the limiting distribution of the estimator 
is small, which is of great interest for constructing tests and confidence sets 
for low dimensional subsets of entries of the low rank matrix. 

Keywords and phrases: low rank matrix recovery, high dimensional sta¬ 
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1. Introduction 

High-dimensional data have generated a great challenge in different fields of 
statistics, computer science, and machine learning. In order to consider cases 
where the number of covariates is larger than the sample size, new method¬ 
ologies, applicable for the model under some structural constraints, have been 
developed. For instance, there have been substantial works under the sparsity 
assumption including sparse linear regression, sparse covariance matrices esti¬ 
mation or sparse inverse covariance matrices estimation (see e.g. Meinshausen 
and Biihlmann, 2006; Bickel et al., 2009; Huang et ah, 2008; Friedman et ah, 
2008; Cai and Zhou, 2012). In this paper, we focus on the problem of low rank 
matrix recovery and uncertainty quantification. 

There have been quite a few work on estimating low rank matrices in the ma¬ 
trix regression setting (also named the trace regression setting, the matrix com¬ 
pressed sensing setting, or the quantum tomography setting when the parameter 
is a density matrix). Many authors (e.g. Candes and Recht, 2009; Candes and 
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Tao, 2010; Recht, 2011; Gross, 2011) considered the exact recovery of a low- 
rank matrix based on a subset of uniformly sampled entries. Also Recht (2011); 
Candes and Plan (2011); Flammia et al. (2012); Gross et al. (2010); Liu (2011) 
considered matrix recovery based on a small number of noisy linear measure¬ 
ments in the framework of Restricted Isometry Property (RIP). Negahban and 
Wainwright (2011) proved non-asymptotic bounds on the Frobenius risk, and 
investigated matrix completion under a row/column weighted random sampling. 
Koltchinskii et al. (2011) proposed a nuclear norm minimisation method and de¬ 
rived a general sharp oracle inequality under the condition of restricted isometry 
property. Very recently, Cai and Zhang (2015) considered a rank-one projection 
model and used constrained nuclear norm minimization method to estimate the 
matrix. Flammia et al. (2012); Gross et al. (2010) considered a specific quan¬ 
tum tomography problem where the parameter is a density matrix (for more 
details, plasase see Subsection 4.2), and Liu (2011) proved that the quantum 
tomography design setting satisfies the RIP. In addition, Koltchinskii (2011) 
proposed an estimator based on an entropy minimisation for solving a quantum 
tomography problem. 

Goldfarb and Ma (2011); Tanner and Wei (2012) adapt the iterative hard 
thresholding method (first introduced in the sparse linear regression setting, see 
e.g. Needell and Tropp, 2009; Blumensath and Davies, 2009) to the problem of 
low rank matrix recovery in the case where the noise is non-stochastic and of 
small L 2 norm. This procedure has the advantage of being very computationally 
efficient. In the same vein but applied to the more challenging stochastically 
noisy setting, Agarwal et al. (2012) introduced a soft thresholding technique 
that provides efficient result in this setting in Frobenius norm, see also Bunea 
et al. (2011); Chen and Wainwright (2015); Klopp (2015) for other thresholding 
methods in related settings that provide results also in Frobenius norm. 

Another important problem is on understanding the uncertainty associated 
to these statistical methodologies, by e.g. characterizing the limiting distribution 
of the efficient estimators. Yet results in this area for high dimensional models 
are still scarce, available mainly for the sparse (generalised) linear regression 
models (Zhang and Zhang, 2014; Javanmard and Montanari, 2014; van de Geer 
et al., 2014; Nickl and van de Geer, 2014). In the papers (Zhang and Zhang, 
2014; Javanmard and Montanari, 2014; van de Geer et al., 2014), the authors 
focus first on constructing an estimator for the sparse parameter that has good 
properties in risk, and they use then this result to exhibit the limiting 
distribution of their estimator. Knowing this limiting distribution immediately 
enables the construction of tests and confidence sets for low dimensional subsets 
of parameters. 

A similar achievement, i.e. the construction of an estimator that has an ex¬ 
plicit limiting distribution, does not exist in the low rank matrix recovery set¬ 
ting. To the best of our knowledge, moreover, all the theoretical results from 
the above papers on the estimation of the parameter in the noisy setting are 
derived in Frobenius risk -neither in the entrywise matrix L x risk, nor in the 
operator norm (i.e. the largest singular value). 

In our paper, we consider the problem of constructing an estimator for low- 
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rank matrix in a stochastically noisy high-dimensional setting, under the as¬ 
sumption that a RIP-type isometry condition is satisfied (see Assumption 2.1). 
We provide (in Theorem 3.1) error bounds for our estimator in all p Schat- 
ten norms for p > 0. We prove in particular that this estimator has optimal 
Frobenius and operator norm risk by proving that this estimator has optimal 
Lqo risk performance uniformly over any change of orthonormal basis. In ad¬ 
dition, a slight modification of our estimator has an explicit Gaussian limiting 
distribution with bounded bias in operator norm (see Theorem 3.2); and in the 
particular case when the design consists in uncorrelated Gaussian entries, we 
prove that the bias in entry wise norm is bounded as well, which is immedi¬ 
ately useful for testing hypotheses and constructing confidence intervals for each 
parameter of interest, similar to the ideas in Zhang and Zhang (2014); Javan- 
mard and Montanari (2014); van de Geer et al. (2014). Moreover our estimator is 
computationally efficient with an explicit algorithm. The proposed algorithm is 
inspired by the iterative hard thresholding, that refines its estimation of the ma¬ 
trix by iteratively estimating the low rank sub-space where the matrix’s image 
is defined. It requires only O(logn) iteration steps to converge approximately, 
and the computational complexity of the method is of order 0(nd 2 logra) where 
d is the dimension of the matrix, and n is the sample size. 

In the experiment section we first provide some simulations where we illus¬ 
trate the efficiency of our method and explain how it can be used to create a 
confidence interval for the entries of the low rank matrix. We then apply our 
method to a specific quantum tomography application, namely multiple ion to¬ 
mography (see, e.g. Guta et al., 2012; Gross et al., 2010; Butucea et al., 2015; 
Haeffner et al., 2005; Acharya et al., 2015; Holevo, 2001; Nielsen and Chuang, 
2000) , where the assumptions required by our method are naturally satisfied (see 
e.g. Liu, 2011; Flammia et al., 2012). Finally we compare our method with other 
existing estimation methods for the trace regression setting (Candes and Tao, 
2010; Gross et al., 2010; Koltchinskii et al., 2011; Flammia et al., 2012) using 
the gradient descent implementation of Agarwal et al. (2012) and also regular¬ 
ized maximum likelihood based procedures (Butucea et al., 2015; Acharya et al., 
2015). 

As a complement in the Supplementary Material, we adapt our method to 
the setting of sparse linear regressionm and provide an estimator that has an 
explicit limiting distribution (recovering the results of Zhang and Zhang (2014); 
Javanmard and Montanari (2014); van de Geer et al. (2014)). 

2. Setting 

2.1. Preliminary notations 

For T > 0, q G N and !t £ C ? , we write [ u\t for the hard thresholded version of 
u at level T, i.e. for the vector v such that Vi = Uil{\ui\ > T} for i = 1,... ,q. 
For q > 0 and u £ R q , we write ||u ||2 = yjj2i< q \ u i\ 2 f° r the standard L 2 norm 
of u, and ||w||oo = sup i |iq| for the standard norm of u. 
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For a q x q complex matrix A , we write A T as the conjugate transpose of A. 
We write tr(A) = J2k Ak,k for the trace of A, and diag(A) for the matrix whose 
diagonal entries are the same as A while its non-diagonal entries are all zeros. We 
write the entry-wise matrix norm of A as ||^4||oo = ma Xij A (i j |. and its squared 
Frobenius norm as ||A||| = ■ A'j 7 . We write also the operator norm of A as 

||A||s = max, Aj, where the Xi are the singular values of A, and the Schatten p 

norm of A for p > 1 as ||A||s p = ^ JT K) " anci note that mils 2 = ll^lb- 
For T > 0, we write |yljr for the hard thresholded version of A at level 
T for each entry, i.e. for the matrix V such that Vij = A^jl{\Aij\ > T} for 

i,j = 

2.2. Model 

Let d, n £ N. Let M be the set of d x d matrices, and 

M(k), 

be the set of d x d complex matrices of rank less than or equal to k. Let us also 
write 

Mn, 

for the set of orthonormal matrices in A4. 

For X 1 e Ad, 0 £ Ai, we consider the matrix regression problem where for 
any i < n, 

Yi = tr((X*) T 0) + gj, 

where e £ R” is an i.i.d. vector of Gaussian white noise, i.e. e ~ Af(0,I n ) (but 
our results hold in the same way for any sub-Gaussian independent noise e: see 
Remark 3.2), and d < n but d 2 n. Let us write X for the linear operator 
going from M to R n , and such that for any A G M, 

X(A) = (tv((XYA)) . 

\ / i<n 

The model can be rewritten as 


Y = X(0)+e, 


where Y = (Li)i< n . This matrix regression model is directly related to the quan¬ 
tum tomography model (in which case the design X is often chosen to be the 
random Pauli design (Flammia et al., 2012; Gross et al., 2010; Liu, 2011; Gross, 
2011; Koltchinskii, 2011), but it is also related to e.g. matrix completion (Ne- 
gahban and Wainwright, 2011; Koltchinskii, 2011). 

We state the following assumption on the design operator X. 


Assumption 2.1. Let K < d. For any k < 2K. it holds that 


sup 

AGM(k) 


-||X(A)||i 

n 


U\\l <c n (k)\\A\\ 


2 

25 


where c n (k) > 0. 
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Remark 2.1. The above assumption is very related to the Restricted Isom¬ 
etry Property. Typically, for uncorrelated Gaussian design with mean 0 and 
variance 1 entries, it will hold with probability larger than 1 — 5 for c n (k ) < 
C\Jkd\og(l/8)/n where C > 0 is a universal constant. For the Pauli design 
used in quantum tomography, it will hold with probability larger than 1 — 5 for 
c n (k) < C-yJkd\og(d /5 )/n where C > 0 is a universal constant (Liu, 2011) - see 
Subsection f.2 for a description of description of a quantum tomography setting 
in which the Pauli matrices represent measurements. 

3. Main results 

As a generalization of sparsity constraints in linear regression models, we impose 
a rank k < d constraint on a matrix 0 G U. dxd . That is, we require the rows 
(or columns) of 0 lie in some fc-dimensional subspace of This type of rank 
constraint arises in numerous applications such as quantum tomography, ma¬ 
trix completion, and matrix compressed sensing (see e.g. Flammia et al., 2012; 
Gross et al., 2010; Liu, 2011; Gross, 2011; Negahban and Wainwright, 2011; 
Koltchinskii et al., 2011). 

3.1. Method 

Our method considers the parameters B > 0,5 > 0, K > 0. The parameter 5 is a 
small probability that will calibrate the precision of the estimate: the theoretical 
results that we will prove later for this estimate will hold with probability 1 — 5, 
and the smaller 5, the larger the constant in the bound (see Theorem 3.1). 
The parameter K is an upper bound on two times the actual low rank of the 
parameter 0. It does not need to be tight, and the final results will not depend on 
it as long \[Kc n (K ) <C 1 (see Assumption 2.1 and Theorem 3.1). The parameter 
B is an upper bound on the Frobenius norm of the parameter 0. It again does 
not need to be tight, but constants in the proof will scale with it. 

We set the initial values for the estimator 0° and the threshold To such that 

0° = 0 G R dxd , T 0 = B G M+. 

We update the thresholds 


T r = Ac n (2K)VKT r -i + v n := pT r _i + v n . 


where v n = C\Jd log ^ &S> , C is an universal constant (see Lemma 5.2) and 
p := Ac n {2K)VK. 

Set now recursively, for r G N, r > 1, 


i " 

T r = - V(A i ) T (K 1 - tr(X*0 r-1 )) G R dxd , 

i=l 
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and let U r , V r € Ai^ be two orthonormal matrices that diagonalise 0 r_1 + 4 ,r . 
Then we set 

0 r = U r [ (C/ ? ') T (0 r_1 + * r )V r ] Tr (V r ) T ■ (3-1) 

This procedure provides a sequence of estimates, and as we will prove in the 
next subsection, this sequence is with high probability close to the true 0 as 
soon as r is of order log(n) (see Theorems 3.1 and 3.2). 

Remark 3.1. Note that although we describe this method using many quantities, 
in fact while implementing our method we only need to set up four quantities: 
p,v n ,To and the stopping time r. We describe in Equation (3.3) how to imple¬ 
ment a good stopping rule, and in Subsection 3.3 how to choose the three first 
parameters. In particular, T 0 can be chosen in a data driven way. 

This method is related to Iterative Hard Thresholding (IHT), a method that 
has been developed for the sparse regression setting (see e.g. Blumensath and 
Davies, 2009; Needed and Tropp, 2009). It is less straightforward to see this in 
this setting, as in the sparse regression setting where we adapt also our method 
in Subsection A, and for a more comprehensive discussion of the relation between 
our method and IHT, see the Remark A.2. Note that IHT algorithms have been 
proved to work in settings where the noise is small and non-stochastic (see e.g. 
Blumensath and Davies, 2009; Needed and Tropp, 2009; Goldfarb and Ma, 2011; 
Tanner and Wei, 2012), but to the best of our knowledge, there are no results 
on IHT in a stochastically noisy setting. 

3.2. Results for the low rank matrix recovery 

Main result for our thresholded estimator We now provide a theorem 
that guarantees that the estimate 0 r after 0(log(n)) iterations has at most rank 
k , and its entry-wise Loo risk and Frobenius risk are bounded with the optimal 
rates (see the first point in Subsection 3.3). 

Theorem 3.1. A ssume that Assumption 2.1 is satisfied and that c n (2K)\/~K < 
1/4. Let r ss 0(log(n)). We have that for a constant C\ > 0 it holds that with 
probability larger than 1 — A and for any k < K/2 


sup 

ee7W(fc),||0|| 2 <B 


10-01s < Cx 


dlog(l/5) 


and also that 


sup rank(0 r ) < k , 

0G7W(fe),||0||2<B 


and also that for any p > 0 


sup || 0 
0e7W(fc),||0|| 2 <B 


0 r 


< cc/* J?MM. 

V n 
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The above theorem proves among other things that our estimate attains the 
minimax optimal Schatten p risk, which other estimates in the literature also 
attain for e.g. p = 2. The first interesting property is that our proposed estimator 
has an explicit algorithmic form and is very computationally efficient. Another 
interesting additional property is that it is also minimax-optimal in operator 
norm (or entry-wise matrix L x risk), and that the entry-wise error is not more 
than yjd/n with high probability for any orthonormal change of basis of the 
matrix 0. This is a strong result since the entry-wise norm is not invariant by 
orthonormal change of basis while the Frobenius norm is. This result is already 
useful for measuring the uncertainty of an estimate (in particular since it does 
not require the a priori knowledge of the rank of the matrix 0). 


Asymptotic normality results To prove asymptotic normality, we slightly 
modify the estimator defined in Theorem 3.1. Consider the estimator 0 r of 
Theorem 3.1 (with r « 0(log(n))) and define 

n 

0 = 0 r + - VPC) T K - tr((X l ) T 0 r )l. 
n 


Theorem 3.2. Set 


Z: =^r E( x *) Te * 

A := ^(0 r - 0) - -L y (A i ) T tr((X i ) T (0 r - 0)). 

v i<n 


Then we have 

y/n{Q - 0) = A + Z, (3.2) 


where Z\X ~ Af( 0, (± . 

Let r ss 0(log(n)). The two following bounds hold for the bias term A under 
two different assumptions. 




Assume that Assumption 2.1 is satisfied for some K > 0 and that c„(2 K)V K 
o(l). If the rank of 0 is smaller than 2 K and if its Frobenius norm is 
bounded by B, there is a constant C± > 0 such that with probability larger 
than 1 — <5 


II Alls 
Vd 


< 4Cic„(2A)V / Xlog(l/<5) = op(1). 


• Assume that the elements in the design matrices X -1 £ A4 are i.i.d. Gaus¬ 
sian with mean 0 and variance 1, and that max(K 2 d, Kd\og(d)) = o(n), 
we have that 


l!A||oo = Op(l). 


Note that this implies the previous result. 
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This theorem, which is in the spirit of the works in the context of sparse 
linear regression of Zhang and Zhang (2014); Javanmard and Montanari (2014); 
van de Geer et al. (2014), implies that there exists an estimator of 0 that has 
a Gaussian limiting distribution, and whose rescaled bias A with respect to 0 
can be bounded i) in operator norm under our Assumption 2.1 and ii) in 
norm as well in the specific case where the design is Gaussian. 

Remark 3.2. Theorems 3.1 and 3.2 are proved for a Gaussian noise e, but 
these results are easily generalisable to any independent, sub-Gaussian noise, 
with a similar but more technical proof (based on Talagrand’s inequality). The 
results of Theorem 3.2 would however be modified in that the random variable 
Z, conditioned on the design X, would then not be exactly Gaussian, but have a 
limiting Gaussian distribution using the central limit theorem. 

Stopping rule r Theorem 3.1 is satisfied after r = 0(log(n)) iterations of our 
thresholding strategy, and so we know what is a theoretical value for r so that 
our strategy works. However, it is possible to propose a data driven stopping 
rule that will perform well. For a desired precision e > 0, we propose to stop 
the algorithm as soon (after having thresholded a last time) as 


T r < (1 + e)—*— v, 


(3.3) 


1 ~ P 


Let us write f for the time where the stopping rule stops. The following result 
holds for the estimator stopped at this stopping rule. 

Theorem 3.3. Assume that Assumption 2.1 is satisfied and that c n (2K)\/~K < 
1/8, i.e. p <1/2. Let e < 0.1 in (3.3. The estimator 0 r satisfies with probability 
larger than 1 — <5 and for any k < K/2 



and also that 


sup rank(0 r ) < k , 
©e-M(fc),||e|| 2 <-B 


which implies for any p > 0 



Moreover f is such that 
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This empirical stopping rule, that does not require the tuning of any addi¬ 
tional parameters 1 , is guaranteeing minimax optimal results in less than log(n) 
iterations. Note that Theorem 3.2 would also hold using this stopping rule - this 
can be proved in the same way as Theorem 3.3 is proved. 

3.3. Discussion 

Comparison of our results with the literature Our Theorem 3.1 gives 
bounds for our estimators in all Schatten p > 0 norms (including the operator 
norm, and therefore uniform entry wise bounds in all rotation basis). A first 
point is that our results are minimax optimal in both Frobenius and opera¬ 
tor norm. The corresponding lower bound in Frobenius norm can be found in 
e.g. Theorem 5 of Koltchinskii et al. (2011) (under an assumption related to our 
Assumption 2.1) or Candes and Plan (2011) under an assumption that is the 
same as ours. The corresponding lower bound in operator norm can be found 
in e.g. Carpentier et al. (2015). In addition to this, the paper Koltchinskii and 
Xia (2015) even contains further lower bounds results proving that the operator 
norm rate \Jd/n (and associated Schatten q norm k x ! q yj~dfn) is optimal also 
in the case of quantum tomography that we use in our experiments later on, 
i.e. under the additional assumptions that the parameter is a density matrix 
and that the design is random Pauli. To the best of our knowledge, our method 
is the first iterative method that has such an optimality property in operator 
norm - for instance, the paper Koltchinskii and Xia (2015) provides results for 
Schatten norms with q £ [1,2], but not for other Schatten norms. Besides, we 
proved in Theorem 3.2 a slight modification of our estimator has an explicit 
Gaussian limiting distribution, and this is, again to the best of our knowledge, 
the first iterative method for low rank matrix recovery that has such a property. 
On top of that, the computational complexity of our algorithm is low as for any 
procedure based on iterative hard thresholding : see the papers (Goldfarb and 
Ma, 2011; Tanner and Wei, 2012). Our assumption 2.1 is a strong RIP condi¬ 
tion. But it is in particular satisfied in the interesting application of multiple 
ion tomography for the natural Pauli design as soon as the number of settings 
is large enough, see Subsection 4.2. 

Operator norm bounds are particularly interesting since they provide an en- 
trywise bound up to any change of orthonormal basis. In particular, they provide 
a bound on the eigen values - and since these bounds do not depend on the true 
rank k , they can be used to implement conservative confidence sets. Moreover as 
highlighted in the papers Zhang and Zhang (2014); Javanmard and Montanari 
(2014); van de Geer et al. (2014), having a bound on the entrywise risk, and 
then an estimator with explicit limiting distribution, is interesting in that it can 
be used to construct tests and confidence intervals for subsets of coordinates of 
the parameter 0. We illustrate this point in the Simulation section (see Sec¬ 
tion 4.1), where a confidence set is constructed using the limiting distribution. 

1 In Subsection 4.1 we introduced the desired precision e. The precision e is a very natural 
quantity to choose for the experimenter—one can set e.g. e = 0.1 for an 1.1 optimal solution 
with respect to the solution outputted by an algorithm that runs for an infinitely long time 
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Note however that the bound on the bias term A in L^ norm in Theorem 3.2 
requires the fact that the design is Gaussian. On the other hand, the bound on 
the bias term A in operator norm in Theorem 3.2 requires only the fact that 
our assumption 2.1 is satisfied. 

Stopping rule r Our theorems are satisfied after r = 0(log(n)) iterations of 
our thresholding strategy, and so we know what is a theoretical value for r so 
that our strategy works. We also defined an empirical stopping rule, see (3.3) 
and Theorem 3.3. We use this stopping rule in practice for all our experiments 
in Section (4). 

Calibration of the parameters of the proposed method Our method is 
not parameter free - there are three quantities that need to be calibrated. The 
two first ones, that we write p and v n . enter in the definition of the thresholds 
sequence ( T r ) r . p controls the rate at which we make our threshold decay, and 
v n /{\ — p) is the quantity toward which it converges when r goes to infinity. The 
last quantity, namely T 0 is the initialisation of the threshold sequences. Here are 
some comments on how to choose these quantities: 

• Rate of decay p : In theory the parameter p can be taken between 1 and 
A-\/~Kc n (2K) where K is an upper bound on the rank of the parameter 
and c n (2K) is the constant associated to the design such that Assump¬ 
tion 5.1 is satisfied. It might not be really possible to compute exactly K 
or c n (2I\) without more assumptions on the design. But there is at least 
one design that is interesting in practice, namely the random Pauli design 
for quantum tomography, that is such that we have an upper bound on 
c„(2 K) for all K that is of order \jKd log (d)/ n with high probability, 
i.e. it decays with n (the same holds in Gaussian design up to the log 
term). In this design if n is large enough, we know that taking p = 1/2 
will work - we do not want to take p too close to 1 since the bound on the 
performance of the estimator scales with 1/(1 — p). 

• Smallest threshold calibration v n The interpretation and theoretical 
value of v n is clear: it should be taken to be larger than the S quantile of 
the LHS quantity defined in Equation (5.5) divided by ||A|| 2 . Now since 
we do not have access to this quantile, we calibrate it in the experimental 
section of this paper as an empirical estimator of the asymptotic quantile 
described above (using Theorem 3.2). 

• Initialisation threshold To : The constant To needs to be taken as an 
upper bound on the Frobenius norm of 0. Note first that estimating from 
the data an upper bound on ||0||| is easy under Assumption 5.1, i.e. the 
quantity for n > 0 

-P1!(i + k), 

n 

overestimates ||0||| . In our simulations, we propose a slightly more refined 
heuristic upper bound and use the same To and T r in Subsection 4.1 and 
4.2. 
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To conclude on the practical tuning of the constants, we would like to empha¬ 
size that at least in practical situation, namely quantum tomography, we have 
enough information about both the design and the noise level to know that the 
above calibration will be working, provided that the target matrix is indeed 
low rank. So although the tuning of parameters is always a tricky issue for any 
algorithm, in at least this specific application, our algorithm can be used as it 
is. 

4. Experiments 

In this section we present some experiments, first some simulation for the con¬ 
struction of confidence intervals, and then a formal comparison of our thresh- 
olded estimator with other methods on specific quantum tomography problem, 
namely multiple ion tomography. 


/.I. Simulation results for the construction of entry-wise 
confidence intervals 

We performed experiments for low-rank matrix recovery, with matrix dimension 
d . We consider a Gaussian design where each X* j, ~ W(0,1) and are indepen¬ 
dent. We also consider a Gaussian uncorrelated noise e ~ A/"(0, /„). We consider 
a parameter 0 of rank k that is stochastically generated in an isotropic way as 

k 

Q = where, N t ~ Af(0, I d ). 

1=1 

We implemented our method choosing a data-driven heuristic for the choice 
of our parameters. We first set 

0 ° = 0 . 

We set for any r > 1 

a 2 r = \\Y - Jl/n, (4.1) 

i.e. the empirical risk, and 


v„(r) = d r 



(4.2) 


where qgo% is the 90% quantile of a Af( 0,1) random variable. v n (r) replaces here 
v n , and is a heuristic high probability bound on the error for each coordinate. 
We set 

Ti = B = o 1 +v n {l), (4.3) 

which is by construction higher than the Frobenius norm of 0 with high prob¬ 
ability, and 

T r = pT r _\ + v n (r), (4.4) 
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where we select p = 1/2 (we take 1/2 so that the decay is not too fast, but also 
so that 1/(1 — p) is not too large). 

We also use the heuristic stopping rule described in Equation (3.3), i.e. we 
iterate until 

T r < (1 + e) x -- v n {r) = 2.2 v n (r), 

1 - P 

for e = 0.1. 

We also construct, using the limiting distribution results provided in Theorem 
3.2, a confidence set for the all the entries of 0 that is such that for any entry 
(m, to'), we set the confidence interval 

O n — C m , 0m,m’ T 

where 

_ - f, 995% 

Cm.m' — (J r /— 5 

y/n 

where t 2 mm , = l/n x Ei< n ( X L,m') 2 ■ 

We provide several results, depending on the values of ( n,p , k), averaged over 
100 iterations of simulations. For these simulations, we present three kinds of 
results: 

• A first set of graphs (Figure 1) presents, for different values of p, k, and 
in function of the sample size n, the logarithm of the rescaled Frobenius 
risk of the estimate 0, i.e. 



lie - e|| 2 \ 
l|0|| 2 r 


• A second set of graphs (Figure 2) presents, for different values of p , k, and 
in function of the sample size n, the logarithm of the averaged diameter 
of the confidence intervals , i.e. 

lo S /p // Cm,™,/ 


• A last set of graphs (Figure 3) presents, for different values of p, fc, and 
in function of the sample size n, the averaged coverage probability of the 
confidence intervals C™ ,m , i.e. 

/ £ 1 {0m,m' € 


All these graphs also exhibit 95% confidence intervals (upper and lower 2.5% 
quantile values from 100 iterations) around their means (dotted lines in the 
graphs, the solid line being the mean). 
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p=64,k=3 p=100,k=3 



Fig 1 . Logarithm of the rescaled Frobenius risk of the estimate in function of n, for different 
values of p,k. The solid line is the average over 100 iterations, the dotted lines form 95% 
confidence intervals. 


p=64,k=3 p=100,k=3 



Fig 2 . Logarithm of the averaged rescaled length of the confidence intervals of the in function 
of n, for different values of p,k. The solid line is the average over 100 iterations, the dotted 
lines form 95% confidence intervals. 
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p=64,k=3 


p=100,k=3 
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p=64,k=10 


p=100,k=10 



1000 2000 3000 4000 5000 6000 



2000 3000 4000 5000 6000 7000 


Fig 3. Averaged coverage of the confidence intervals of the in function of n, for different 
values of p,k. The solid line is the average over 100 iterations the dotted lines form 95% 
confidence intervals. 


These figures exhibit different behaviours depending on the difficulty of the 
problems (increasing with p and more importantly with k ). The graphs in Fig¬ 
ure 1 for k = 3 (and p £ {64,100}) exhibit first a very fast decay of the risk, 
until some critical threshold n = ckd where c seems to be between 10 and 20. At 
this point, one can actually observe that the method recovers in most case the 
true rank k of the matrix, whereas it before recovered only a smaller rank ap¬ 
proximate of 0—with a too small n, it could not distinguish all the signal from 
the noise, and the fact that it gradually does for larger n explains the fast decay 
of the logarithm of the rescaled risk. After that, the curve has a kink and the 
decay becomes slower (the theory predicts that the logarithm of the rescaled risk 
should decrease with n as — log (n)). After this kink, all the k “rank directions” 
have been identified, and the logarithm of the rescaled risk starts decreasing 
slower, according to the theoretical rate of — log(n). The graphs in Figure 1 for 
k = 10 (and p £ {64,100}) exhibit mainly the first regime, since k is larger 
and the second regimes comes for larger values of n —empirically, we can ob¬ 
serve that the method recovers most of the time all k “directions” as soon as 
n = 4000 for p = 64, as soon as n = 6000 for p = 100. 

A parallel evolution can be observed in Figure 2, for the logarithm of the 
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average length of the confidence intervals. It is not at all surprising since this 
length is supposed to reflect the risk. The averaged coverage of these intervals 
in Figure 3 is in average larger than 87% in all cases, and in more than 95% of 
the cases, it is higher than 74% in all cases, which makes our method reliable. 

4- 2. Quantum tomography experiments 

4-2.1. Description of the ion tomography setting 

An important application that satisfies our assumptions and to which our method 
can be applied is quantum tomography , i.e. the estimation of quantum states. 

We consider the popular problem of multiple ion tomography, i.e. the problem 
of estimating the joint quantum state of m two-dimensional systems (qubits), 
as encountered in ion trap quantum tomography, see Guta et al. (2012); Gross 
et al. (2010); Butucea et al. (2015); Haeffner et al. (2005); Acharya et al. (2015) 
or Holevo (2001); Nielsen and Chuang (2000) for textbooks on this problem. 
Such a system’s quantum state can be represented by a positive semi-definite 
unit trace complex matrix 0 (a density matrix ) of dimension d := 2 m . 

For each individual qubit, the experimenter can measure one of the three 
Pauli observables described by the 2 by 2 Pauli matrices oq, 02 , 173 , where 


' 0 

1 ' 

2 

' 0 

—i 

3 

' i 

0 

1 

0 

, (7 = 

i 

0 

, cr = 

o 

_i 

-1 

i-H 


and each of these measurements may yield one of two outcomes, denoted by +1 
and —1 respectively. 

Therefore, a full experiment (i.e. an experiment that describes the measure¬ 
ment for each of the m qubits) is then defined by a setting S = (si,..., s m ) 
where each si G {oq, 02 , 03 } for l < m, which specifies which of the 3 Pauli ob¬ 
servables is measured for each qubit. For each fixed setting S, the measurement 
produces random outcomes O G {+1, — l} m , with expected probability 

PO,S = tr(Po,s©), 


where 


Po,S = 7Toi, Sl ® • • • ,®7r 0m>Sm , 


where n 0!jSi is the eigen projector of the the 2 by 2 Pauli matrix si associated to 
the eigen value o; (we remind that the 2 by 2 Pauli matrices have eigen values 
of either +1 or — 1 ). 

Now set 




1 0 
0 1 ’ 


(4.6) 


for the last 2x2 Pauli matrix such that (ot)ie{o,...,3} form an orthogonal basis of 
C 2x2 . Let O be the outcome of an experiment given a setting S = (si,..., s m ) 
(where each si G oq, < 72 , 03 ). Write S(E) for a setting where a subset E C 
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{1, ..., m} of the to matrices si of this setting have been replaced by erg, and 
O(E) for the outcome where the same subset E of the to elements o; have been 
replaced by 1. Since the only eigen value of cto is 1, the outcome of the measure¬ 
ment of a qubit by op is always 1. This implies in particular that the distribution 
of O(E) as described above, is the same as the distribution of the outcome of an 
experiment when the measurement setting is S(E). For this reason, measuring 
according to setting S actually gives information about all settings S(E) for any 
subset E of {1, ... ,to}. Thus instead of measuring all settings which are tensor 
products of 2 x 2 Pauli matrices cr 0 ,..., 03 , it is enough to measure all settings 
which are tensor products of 2 x 2 Pauli matrices {cri , cr 2 , CT3 }, as they provide 
information about corresponding settings that involve op. Therefore if one mea¬ 
sures all 3 m settings that correspond to the settings S = (si,...,s m ) where 
each si £ {cxi, cr 2 , cr 3 }, we have observations about all measurements direction, 
and our measurement setting is complete. 

Now we are interested in also dealing with situations where one does not 
want to observe all 3 m settings, and where one has only a number of settings 
N < 3 m . 

We consider a random measurement setting as in Flammia et al. (2012), 
i.e. each measurement setting S is chosen uniformly at random (each si is cho¬ 
sen uniformly at random among ay, ..., op). Let N be the total number of 
measurement setting chosen in this random way. For each chosen measurement 
setting S , we perform T repetition of the experiment and observe T independent 
outcomes. So for each chosen measurement setting S l with i < N, we observe 
T independent outcomes O t,s which are observations according to setting S l . 

4-2.2. Expression of the outcomes in the trace regression model 

It is often convenient to express the information contained by a measurement 
( S , O) in a way that involves tensor products of 2 x 2 Pauli matrices, rather than 
their spectral projections, see Flammia et al. (2012). Indeed, the set of matrices 
that are created by to tensor products of 2 x 2 Pauli matrices erg ,..., <73 is exactly 
the set of 2 m x 2 m = d x d Pauli matrices rescaled by \fd (introduced briefly in 
Remark 2.1) i.e. the dx d rescaled Pauli basis. Indeed let f(O) = o/, then 

one easily verifies that 

tr((si ® ® s m )0) 

oe{i-i} ra 1 

where Eq |5 is the expectation according to the outcome O when measurement 
S is chosen. In this sense, the measurement described by the dxd rescaled Pauli 
matrix P$ = s 1 (g> ■■■<§) s m can be measured by the parity of the spins f(0) 
that one gets when performing measurement S : in fact f(0)\S is a random 
variable such that its value is 1 with probability (tr(P$0) + l) /2 and —1 with 
probability 1 — (tr(Pg0) + l)/2 - and its expectation is tr(P§0) as noted. We 
write P(tr(Ps©)) for this distribution. 
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Let us go back to our experimental setup. As remarked before, each of the 
N quantum measurement settings according to setting S l (with i < N), where 
S l = (s\,..., s l m ) and where each s\ £ { 01 , 02 , 03 } for k < m provides us with 
d information in the sense of our trace regression model, i.e. we observe at each 
measurement S l , for each replication t <T and for all E C {1,... rn} 

Vs\e = f(O tA (E)) ~ H(ti(P §i{E) G)). 

In our trace regression model, we can average the observations f(O t,Si (E)) 
and we have the the following averaged observations for any i < s and for all 
E C { 1,... m} 


Y S\E - Y ^2 y s\E - + e s\E, 

1 t<T 

where E is the t th repetition (among T iterations) of the observation and 
where Iske is the averaged noise and is such that ^(ot.s^isAsb-E = 0, and such 
that e$i e is sub-Gaussian has a sum of bounded random variables and satisfies 
E( 0t ,s-))| S i exp(A e S i, E ) < exp(A 2 /T) for any A > 0. 

Now in order for our Assumption 2.1 to be satisfied for rank k matrices for a 
large enough number of settings N, we need to rescale our data. We set 

Y s *,e = Vd3-^Q m/ %, E = Vd3-^ 2 (^) m/ \T(P^ {E) Q)) + e s *,E, 

where \E\ is the cardinality of E and where eg. E is the rescaled noise and is 
such that E(ocs i )|si e S i ,E = 0, and such that es\ E is sub-Gaussian and satisfies 

( \ m 

|J /T) for any A > 0. It is a direct 
consequence from the results of Liu (2011) and from our Remark 2.1 that if 
N > 0{k 2 d\og{d)), then with high probability on the random draws of our 
settings we have that Assumption 2.1 is satisfied for rank k matrices. We can 
therefore apply our method and other low rank recovery methods such as trace 
regression methods to our rescaled data 



4-2.3. Experimental results 

We let m £ {4,5,6} so that d £ {16,32,64}, and let k £ {1,2} with a £ 
{2, 3,4,5} and consider N = akd measurement settings. These settings with 
small k and a were chosen since we are more interested in the truncated measure¬ 
ment setting (such that N < 3 m ). For the replication, we consider T £ {d, 10d}. 
Using the data (4.7), we estimate 0 by three methods—our proposed method 
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(IHT), the truncated maximum likelihood estimator (MLE) for the high dimen¬ 
sional multiple ion tomography model as described in Acharya et al. (2015), and 
nuclear norm penalization (NNP) method computed using a gradient descent 
method (e.g. Agarwal et ah, 2012). 

We use the same tuning parameters as in (4.1), (4.2), (4.3), and (4.4) and 
also we select p = 1/2 and the same stopping rule as we describe in Subsection 
4.1. For the stepsize of the gradient descent method used to compute the NNP 
estimator, we follow the recommendation in Subsection 3.1 in Agarwal et al. 
( 2012 ). 


Figure 4 and 5 present the result when the number T of replications is d and 
when the true rank is 1 or 2 respectively, and for four different values of d. We 
provide average values of squared Frobenius norm, operator norm, entrywise 
Loo norm, and Shatten 1 norm of 0 — 0 averaged over 100 iterations. Red dots, 
blue blank triangles, and black asterisks are average value of IHT, MLE, and 
NNP, respectively. Intuitively when a increases these risks will decrease. Our 
estimator shows almost comparable result to the MLE except in the case where 
d = 4 and a = 2. In this case, IHT estimates 0 by 0 a few times and pretty well 
for most cases so that on average the Frobenius norm is still large. Figures 6 
and 7 present the result when the replication is lOd and when the true rank is 1 
or 2 respectively, and for four different values of d. They show similar patterns 
as for the cases T = d, but we can see that IHT performs well especially for 
a € {4,5}. An interesting feature that all these pictures illustrate is that IHT 
performs the best relatively to other methods for a large number of replication 
T, and for difficult problems with high d and k (see in particular the plots in 
Figure 7 for large d). Note that our method is computationally more efficienct 
than the other two methods in the sense that when d = 64, a = 5, k = 2, IHT 
takes about 40 seconds while MLE (and even NNP) takes about 2.5 minutes for 
one iteration, on a regular laptop, i.e. it is about four times slower. 


5. Proofs 

5.1. Preliminaries 

For convenience in writing the proofs, we introduce the following quantities. 

We write, for integers q , q ', the vectorisation of a q x q’ matrix (where q' > 0) 
A by stacking the rows of 4 S as 

Vec(A) = (Ayr, Ay2 j • • ■ , A \ q ’, A2 p, . . . , A‘2 1 q' , . . . , Aq \, . . . , Aq q' )^. 

We write the Kronecker product between two matrices A and B as 4 0 B. 
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Fig 4. Squared Frobenius norm (L2), operator norm (Oper), entrywise Loo norm (Linf) and 
Shatten 1 norm (Shatten) ofQ — Q in function of a (and therefore in function og the number 
of settings N), for different values of d for replication T = d and k = 1 using the three 
methods described. 
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L2(d=16, k=2, T=d) 
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L2(d=32, k=2, T=d) 



L2(d=64, k=2, T=d) 








Fig 5. Squared Frobenius norm (L2), operator norm (Oper), entrywise Loo norm (Linf) and 
Shatt.en 1 norm (Shatten) of 0 — 0 in function of a, for different values of d for replication 
T = d and k = 2 using three methods. 
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L2(d=16, k=1, T=10d) 
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Fig 6. Squared Frobenius norm (L2), operator norm (Oper), entrywise Loo norm (Linf) and 
Shatten 1 norm (Shatten) of 0 — 0 in function of a, for different values of d for replication 
T = 1 Oil and k = 1 using three methods. 
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Fig 7. Squared Frobenius norm (L2), operator norm (Oper), entrywise Loo norm (Linf) and 
Shatten 1 norm (Shatten) of 0 — 0 in function of a, for different values of d for replication 
T = 1 Oil when k = 2 using three methods. 
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Consider the n x d 2 matrix X such that X i M = X l m m , for * < n and for 
M = (m — Vjd + ml < d 2 where m,m' = 1,... ,d: 


vec(A 1 ) T 


r X i,i 

X l\ 2 ■ 

• x\ td . 

■ X d,l ■ 

• x ld] 

vec(A 2 ) T 

= 

x h 

^i 2 2 ■ 

• X l,d ■ 

■ X h • 

■ x ld 

vec(A n ) T 


. ^r,i 

yn 

A l,2 * 

• ■ 

■ X h • 

yn 

• A d,d J 


Let IZ(k) be the set of vectorization of matrices in JA(k), that is, 7 Z(k) = 
{vec(A) : A £ Al(fc)}. HA £ Ad{k), then 7 Z(k) contains a vector a of dimension 
d 2 such that a m = for M = (m — 1 )d + m! £ {1,..., d 2 }. 

Assumption 2.1 can be rewritten as follows in this vectorized new notation. 

Assumption 5.1. Let K < d. For any k < 2K, it holds that 


sup -||AL4 || 2 

Aeiz(k) n 


Pill 


< Cn(k)\\A\\ 


2 

2 > 


where c n (k) > 0 . 

Assumption 5.1 actually implies the following lemma that bounds the scalar 
products rather than the norms. 

Lemma 5.1. If Assumption 5.1 holds, then for any k < K, we have that 
sup ~{XA,XB) - (A,B) 

A,B<ElZ(k ) 2 n 

< 2c„(2fc)||A|| 2 ||B|| 2 =: \\A\\ 2 \\B\\ 2 c n (k). (5.1) 

The proof of this lemma is in Subsection 5.3. 

5.2. Proof of Theorem 3.1 

Let fl be the set of vectors of and of norm 1 . 

1. Explicit writing of the quantities For any matrices U, V of dimension 
d x m with m > 1 , we set 

7 r (U, V) := U T (^ - tr((X i ) T 0 1 ’ -1 )]) V 

i<n 

= uT (~^2( xi ) TY l) v = ] u T v r v 

i<.n 

where Y{ = Y t - tr((X 1 ) T 0 r_1 ) and 7 r {U,V) £ R mxm . Also we set 'F r := 
0 — 0 r_1 and 


7 r {U,V) := U T (Q - 0 r-1 ^E = U T ^ r V £ R mxm 
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Note that Y[ = tr((X*) 2 T') + e* by linearity of the trace. 

Let u,v £ f 2 , then 7 r (u, v) is a scalar: 

7 r (u,v)= E U ™Vm'-J2 X ™’,rn Y i r 

m,m'<.d i<n 

= Y, U rnVrn'^Y X ™ , ,rn( tr (( XZ ) T '& r ) + e i) 

m,m'<.d i<n 

= E U mV m ^E X km( E 

m,m'<.d i<n k,k'<d 

Let U be the column vector of dimension d 2 such that U = vec(uv T ) = uig)v, 
that is, Um = u m v m / for M = (m — 1 )d + m!. Note that U £ 75.(1), and that 
\\ u \\l = E m , m '< d ( u ™ v m') 2 = E m ( u m) 2 E m '(v) 2 = L Consider the n x d 2 
matrix X such that X r M = X % m m , for i < n . Consider the column vector 
%l> r £ where ijf = vec(4' r ). Then we have 

7 r (u,v) = i Y WmE(^,m) T ( E X i,Ki> r K + £i) 

Me{l,...,d 2 } i<n K£{l,...,d 2 } 

= \u) T {X) T (xr + e) 

= ^/xU,Xr) + (XU,e)) : ( 5 . 2 ) 

where here (.,.) is the classic vectorial scalar product on M n . Also by definition 
of U and i/j r , we have 

Y{u,v) = (U,i/j r ). ( 5 . 3 ) 

The last equation implies that 


sup | 7 r (u,v) - 7 r (u,v)| 

u.vEQ 


< sup 

-(XU,X^ r ) - (U,-tp r ) 

+ sup 

~(XU,e) 

WG7?.(1),||W|| 2 = 1 

n 

WGK(1),||W|| 2 =1 

n 


2. Bound on the stochastic term We first bound the second term in (5.4) 
with the following lemma. 

Lemma 5.2. Assume that c„(l) < 1 (note that c n ( 1) < c n (K) for K > 1 ). It 
holds with probability larger than 1 — 5 that 


sup 
AeK( i) 


(XA,e) <C\\A\\ 2 x d 


,log(l/5) 


\\M\ 2 Vn, 


where C is an universal constant. 


(5.5) 
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Its proof is in Subsection 5.4. Lemma 5.2 implies that on an event of proba¬ 
bility larger than 1 — <5, we can bound the stochastic term in (5.4) 


sup 

u&n(i),\\u\\ 2 =i 


{XU,e) 


< v n . 


(5.6) 


Let £ be an event of probability larger than 1 — 5 where the above holds. 


3. Bound on the first term in (5.4) provided that the rank k r of 'L r 
is smaller than 2k Let us assume, only for this Paragraph 3. of the proof, 
that the rank k r of 4 ,r is smaller than 2k < K. By Lemma 5.1, we can apply 
Equation (5.1) (since k r < 2k < K ), and combining this with the fact that 
\\U \\2 = 1 , we have 


sup 

weTC(i),||w|| 2 =i 


(xu,xr)-(u,r) <c n {2k)wh- 


(5.7) 


By combining Equation (5.4), (5.6) and (5.7), and using \\ip r \\2 = ||^ ,r || 2 , we 
then have in the case that k r < 2k < K that on £ 


sup | 7 r (u,v) - 7 r (u,v)| < c n (2fc)||'E r || 2 +v n . (5.8) 

u,v(Ef2 


Since the previous result holds in the worst case of u,v € fl, we directly have 
on £ the corresponding entrywise result whenever k r < 2k 


sup ||7 r (^^)-7 r (^^)l|oo<c n (2A : )|!4'l2 + t-r l . (5.9) 

U,VeM n 

By definition, we know 7 r (U, V ) = U T 'i> r V and 7 r (U, V) = U T 'V r V = U T (Q — 
O r ~ 1 )V, which gives on £ whenever k r < 2k, 

sup ||f/ T (4' r + 0 r_1 — 0)1^1100 < c n (2fc)||\E' r '|| 2 + v n . (5.10) 

u,veMn 

Note also by definition of the thresholding process, the matrix 

d = (u r ) T {w + e r - 1 )y r - [(u r ) T (i’ r + 0 T - 1 )y r j Tr , 


is such that it is diagonal with all diagonal elements smaller than T r . Let 
U = (U r ) T U e Mn and V = ( V r ) T V € Mq. for U, V £ Mn- By elemen¬ 
tary calculations, we have 

U t DV= (Y.UkfO^kVkj).., 

k 

and so we have that 


||f/ T W||oo <sup| E 

^ k 

< T r sup ^ \Uk,iVkj | < T r sup ^/ilEibirEib = T r- 
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By definition of 0 r , we have 

u t dv = u T {i> r + &- l )v - u T u r i(u r ) T $ r + Q r - l )v r \ Tl .{v r ) T v 
= u T {i> r + e"- 1 )^ - u T e r v ., 

so this implies that 

sup \\U T (^ r + Q r ~ 1 )V-U T e r V\\ 00 <T r . 

U,VGMn 


Combining this with Equation (5.10), we obtain that on £ and whenever k r < 2k 

sup ||tf 3 V +1 V|| 0O = sup ||t/ T (0-0 r )y||oo 
U,VeMn U,VeMn 

<c n (2k)\\* r \\ 2 + v n + T r . (5.11) 

4. Induction We now stop assuming that the rank k r of v I' r is smaller than 
2k, and we consider the general case. 

We are going to prove by induction that on £, for any integer r > 1, we have 
that (i) the rank of 0 r_1 is smaller than k, and (ii) ||^ ,r ||2 < 2^/2kT r _\ := C r . 

For r = 1, since 0° = 0, then its rank is 0 and is therefore bounded by k and 
(i) is satisfied. Moreover, since T 0 = B > ||0|| 2 = ||0 — 0°||2 = ||’L 1 || 2 , then (ii) 
is satisfied as well. 

Now assume that (i) and (ii) hold on £ for a given r (as it holds for r = 1 not 
only on £ but on the entire probability space). By induction assumption (i), we 
have that on £ the rank of 0 r_1 is smaller than k, which implies that the rank 
of 4 ,r ' = 0 — 0 r_1 is smaller than k + k = 2k. 

Because we have that k r < 2k, Equation (5.11) applies and on £ 

sup ||£/ t 'I' , ' +1 E|| 00 < c n {2k)C r +v n +T r 
uyeMn 

< 2v / 2fcc rl (2fc)T r _i + v n + T r < 2T r , (5-12) 

by definition of T r and since 2\/2kc n (2k) < 4.\/l(c n (2K) (since 2k < A'). More¬ 
over, in the same way, we have that on £ (see Equation (5.10) since k r < 2k) 

sup || U T {i> r + 0 7 - 1 - 0)E||oo < c n (2k)C r +v n < T r . (5.13) 

uyeMn 

Let us now state the following lemma. 

Lemma 5.3. Let M £ R dlXd2 a ma t r i x (with d\ > d-z), with singular values 
(A j)j ordered in decreasing order (all positive). For any j < d 2 and for any 
collection of orthogonal vectors (w J )y'<y_i > we have 

A j < sup |u T Mv|. 

uSR d i ,veR d 2 : ||u 2 ||=l,||v||2 = l,uJ_(wJ') i /<j_i 
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Write (Apj for the singular values of 4' r + 0 r_1 ordered in decreasing order 
and all positive (and U r , V r for the diagonalising matrices). Let U*, V* be the 
matrices that diagonalise 0 and order its singular values in decreasing order on 
the diagonal and write (A *)j for its singular values (all positive). By Lemma 5.3, 
we know that, for any j < d , 

X r j< sup |u t ('T + 0 r - 1 )v|. 

u,vef2:u_L(£7 1 * 

Therefore, on £, by Equation (5.13), we know that for any j < d 
Xj < sup |u T 0v| +T r = X* + T r . 

u,veO:u_L(C/ i * )l<j-i 

So since all At that are smaller than T r are thresholded for constructing 0 r (we 
remind that the Xj are the diagonal elements of the diagonal matrix (U r ) T (’$’ r + 
Q r ~ 1 )V r , that is thresholded at level T r in the construction of 0 r ), it means 
that on £, the rank of 0 r is smaller than the rank of 0, i.e. it is smaller than k. 
This proves the first part of the induction (i). 

Now let U r ,V r be the matrices that diagonalise 'L r+1 , and let D r+1 = 
(t/ r ) 7 v]/ r+1 lA r . By (5.12), we have that on £ 

p r+1 ||oo = ||(tf r ) T * r+1 ^loo < 2 T r . 

Now since the rank of both 0 r and 0 are smaller than k on £, we know that the 
rank of T r+1 , and thus of D r+1 , is smaller than 2k. Therefore, we have since 
D r +i 

is diagonal and has therefore only 2k non-zeros elements that on £ 
\\D r+1 \\ 2 < 2V2kT r , 

which implies that on £, since the Frobenius norm is invariant by rotation 

||^ r+1 ||2 = ||^ + 1 || 2 < 2 V 2 fcT r . 

This concludes part (ii) of the induction and therefore, it concludes the induc¬ 
tion. 

5. Conclusion By the previous induction, we know that on £, we have 

sup ||t/ T ^ r+ V|| 00 < 2T r , (5.14) 

ee7^(fc),||0|| 2 <B,!7,yeAi^ 

and also that 

rank(0 r ) < k, 

and also that for any p > 0 

sup ||^ r+1 ||s, <2(2fc) 1/p T r . 

0e-R.(fc),||0|| 2 <B 

This concludes the proof since for r larger than c; log(n) with q a large 
enough constant, we have by definition of the sequence T r that 

T r < 2v n < 

V n 
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5.3. Proof of Lemma 5.1 

First, note that for A £ 7Z{k),B £ 7Z(k), we have 


\B\ 


xJ^,x- t B 


\ 1411s 


B 


= \(XA,XB)-(A,B)\. 


|A|| 2 ’"||B|| 2 / \||A|| 2 ’||B|| 2( 

Thus, without loss of generality, we consider ||A|| 2 = ||i ?||2 = 1- We know that 

{XAXB) = \\XA\\l + \\XB\\l-\\X(A-B)r^ 


and 


(A,B) = 


This gives 

— (XA, XB) - (A,B) 
n 

1 

< - 
“ 2 


-11*41!-HU 


+ ||B||!-||A-B| 


-WXBWl- \\B\\l 


i||*(A-B)||!-H-B||!|). 


By Assumption 5.1, using A — B £ 72.(2fc), we have for k < K , 

| {XA, XB) - (A, B) | < 1 (c„(fc) + c n (k) + 2c„(2fc)) < Tc n (2k) =: c n (k). 


This concludes the proof. 


5.4- Proof of Lemma 5.2 

Since e ~ Af(0, I n ), we have that 

— {XA, e) ~ A/ - (0, — 2 ||*A||!). 
n n z 

This implies that (using a Gaussian tail probability P(|X| > x) < e~ x / 2 for 
x > 0 when X ~ Af( 0,1)) with probability larger than 1 — <5 


-{XA, e) <-\\XA\\J\\og{l/8). 
n n V 2 


Since X satisfies the Assumption 5.1 with K > 1, we have that 


(5.15) 


sup 
Aen( 2 ) 


1||*4|!-||4|! < £,,(2)1141!, 

n 


which implies that for any A G 72.(2), we have 

14412 < V^||4| 2 v'i + c n (2). 


(5.16) 
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Equation (5.16) implies together with Equation (5.15) that for a matrix A £ 
75.(2), with probability larger than 1 — <5, 


~(XA,e) 

n 


< 


1 + Cn( 2) 


Plb 


log (1/5) 


=: PprP), 


(5.17) 


where v n (S) = \J 1+5 ^ log( n 1/<5) . 

To obtain the bound for the supremum of the quantity in (5.17) over all 
A £ {A, A £ 75(1), ||4|| 2 < 1} =: -4(1), we consider the approximating set 
B 0 C C ... whose property is described as follows. Let B 0 = {0}. Let, 
for any i £ N*, £>, be a 2 _ * covering set of .4(1). Here we use a classic result 
(Candes and Plan, 2011, Lemma 3.1), saying that the u-covering numbers of 
.4(1) is bounded by ( C/v) 2d+1 . 

Thus the cardinality of Bi is smaller than (C 2*) 2d+1 . Let £ be the event such 
that for all i, j £ N 2 and for each vector in u,v £ Bi x Bj, it holds that 


1 

n 


(X(u 


< u 


v|| 2 v n (Si,j), 


(5.18) 


where 8jj = 5(C"2 max ^ J )) _7<i , where C > 2 C is a large constant. By Equa¬ 
tion (5.17), and since u — v £ 75(2) we know that (5.18) holds with probability 
1 — 5ij for each i, j and for each vector u,ve6iX Bj. By a union bound, we 
have that 


no>i- E WoKj 

i,j€ N 2 

> 1 - 25(^(C'2 l ) 2d+1 ^(C'2 J ') 2d+1 (C , 2 max(ij) )- 7d ) 

i j<i 

> 1 - 2C' 4d+2 (C' , ) -7d 5(^^2 4di+2i i(2 l )- 7d ) 

i 

> 1 - 2C 4+2 (C' , r 7d 5(E* 2 ” 1 ) = 1 - 2 C 4d+2 (C')~ 7d S 

i 

>1 — 5, 


since C > 2 C. 

Let now A £ 75(4) such that ||4|| 2 = 1. It is possible to write A as 

OO 

A = E( u * - Ui_i), 

i—1 

where each u* belongs to Si, and where the (u i)i are such that ||— u^_ 1 1 |2 < 
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2 1 . We have on £ that 


sup 


i i i ^ 

~(XA,e) = -(^( 5 Z(ui ~ u *-i)),e) = - - U;_i),e) 

n n ' n L ' 

2—1 2=1 


< V -(Af(ui - Uj_i), e) 
z ' I n 
2=1 

OO 

< ^ ll U * - U,_i|| 2 Un((5i,i-l) 
2=1 

OO 

2=1 
OO 

<j2 2 ~ ic 


log ((c , 2 i - 1 ) M / < y) 


< Ct/d 


,log(l/<5) 


This concludes the proof. 


5.5. Proof of Lemma 5.3 

Let (u fe )fc £ R dl , (v fc )fe £ R d2 be the singular vectors of M, i.e. Mv fc = AfcU fc . 
Let E = span {{v)k<j)- The dimension of E is j. Let now F be the vectorial 
sub-space that is orthogonal to span((w J 1 ). Its dimension is d 2 — j + 1. 

Since dim(Ti) + dim(F) = d 2 + 1, there is at lest one unitary vector in Ef]F. 
Let h £ R d2 be this vector, since it is in E , it can be written as 

h = ^2 h kV k 

k<j 


where for k = 1,..., j, we have hj. > 0 and = 1. Therefore, we have that 

Mh = J2^kh k u k . 

k<j 


Consider g = Mh/||Mh|| 2 . So we have that 

'« TMh '= !( rr ,} =» Mh||i 


= X l h l > Jmjn(A2) xJ2 h l = X j, 


since the (u fc ) fe are orthonormal, and since the singular values are positive and 
ordered in decreasing order. This concludes the proof. 
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5.6. Proof of Theorem 3.3 


From the proof of Theorem 3.1, Equation (5.14), we know that the estimator 
0 r satisfies 

||B f - e|| s <T,< 1.1—!—«„ = O(^djn), 

1 - P 

i.e. the desired result in operator norm and from which the results on the rank 
and in the other Schatten norm follow because of the thresholding. Also since 
the sequence T r is an arithmetico-geometrical sequence converging to jz~fV n , of 
arithmetic term v n and of geometric term p, it is clear that f is such that 


P^To > 


0.1 

l-P 


V n , 


i.e. 


log (l0(l - p)T 0 /{v n )) 
f < 1 H-- —-— , N -— < 0(log(n)). 


log(l/p) 


This concludes the proof. 

5.1. Proof of Theorem 3.2 

By definition, we have that 


1 

Vn(0 - 0) = V^(B r - 0) + ^ tr((X l ) T (0 - 6 r )) + ef) 

n 

= Vn(Q r - 0) + -= ^(X i ) r (tr((X l ) T (0 - 0^))) 

vn i=l 

+ 4x>‘) T e < 


i -1 


= X + Z. 

Let to, to' < d and let u m be the vector with all element equal to 0 except the 
m th entry which is equal to 1, and we consider that U m ' rn = vec(u m (u m ) T ). 
We have by definition and using representations (5.2) and (5.3) that 


1 


A m,m> = yfrl ~{xu m ’ m ,xr +1 ) - 

\n 


and 


Z, 


1 n 

T.,m' /— / ,(X )m,m'*A 

V n T-f 

t—1 


Note that given X 1 , i = 1,..., n, 


ri \2 
)m,m' 


Var(Z m , m ,) = - V(X l 

n z ' 

i=l 
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and 


Co v(Z jJ/ ,Z ltl ,) = -Y / (X i ) j ,AX%'. 


i =1 


The following Lemma is a concentration inequality that holds in Gaussian 
design. 

Lemma 5.4. Assume that the design is Gaussian. Let A £ A4. We have that 
with probability larger than 1 — <5 (on the design) 


log(l/5) log(l/<5) 


) =: \\Mlvn(6). 


-Mi-Mil! <c\\A\\l 

Proof. Let A £ lZ(k). We have 

^=XA~Ar(0,±\\A\\ll n ), 

where I n is the n x n identity matrix. This implies that 

hxA\\*~h\ A \\lxl=/\A\\tY / xl 

n n n z ' 

i<.n 

where xj is the chi square distribution with j degrees of freedom. By Bernstein’s 
inequality, we thus have (since the xl distribution is sub-Gaussian) that, with 
probability larger than 1 — <5, 


i<.r, 




where C is an universal constant. This implies that, with probability larger than 
1 - <5, 


l\\XA\\l-\\A\\l <C\\A\\H 


This concludes the proof. 


log(l/(5) log(l/<5), 


=: M||^„(<5). 


O 


Combining Lemma 5.4 with Pythagoras’s theorem as in the proof of Lemma 5.1, 
we have that for any A, B £ A4, with probability larger than 1 — <5, 

~{XA, XB) - (A, B) < 45 n (<S/3)M|| 2 ||B|| 2 . 
n 

By a union bound, this implies that with probability larger than 1 — 5, 

Halloo == sup |A m?rn /| 

m<d,m' <d 


-(XU™’ 171 ', Xif r+1 ) - lu m ’ m ' ,if r+1 ) 
n 

< Vn\W +1 h ( 4 n„(( 5 /( 3 d 2 ))) 


= \fn sup 

m<.d.m' <.d 


< C\fn 


kdlog(l/5) /log(d/<5) 
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where C is a universal constant. This concludes the proof (in remarking that 
the above quantity is arbitrarily small when kd\og{d ) = o(n)). 
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Supplementary Material 

Appendix A: Results for the sparse linear regression model 

The method that we proposed and studied in the low rank matrix recovery 
setting can be adapted and simplified to accommodate another setting : the 
sparse linear regression setting. We explain how to construct an estimator based 
on IHT, and prove that the estimator is efficient in L 2 and Loo norm, and provide 
the limiting distribution of a simple modification of our estimate. 


A.1. Setup 


We let B(k) := B 0 ( 0, k) be the “Iq(R p ) ball” of radius k, i.e. B{k) is the subset 
of the vectors u £ R p such that u has less than k non-zero coordinates. 
Consider the linear model 

Y = X6 + e, 

where A is a n x p matrix, the signal vector 6 £ R p is fc-sparse ( 0 £ B(k)), 
and e £ R n is an i.i.d. vector of Gaussian white noise, i.e. e ~ Af(0,I n ) (as in 
the matrix regression, we do not need the Gaussian assumption and our results 
hold with sub-Gaussian independent noise), and p^> n. We denote the sample 
covariance matrix by E = -X T X £ R pxp . 

Assumption A.l. Let I\ < p. We assume that there exists a matrix V such 
that for any k < 2K 1 there exists a constant r k > 0 such that 


sup 

u£B(k) 


\\vtu - m||oo 

IMloo 


< r k . 


Remark A.l. Suppose X is from a distribution whose covariance matrix is 
E £ R pxp . Let the minimum eigenvalue a m i„(S) > C m i n > 0 and the maximum 
eigenvalue cr max {Tj ) < C max < 00 and maXi £ u]T,a < 1. Assume that AE -1 / 2 
has independent sub-Gaussian rows with zero mean and sub-Gaussian norm 
IIS-^ArlU = n. Then from the paper (Javanmard and Montanari, 2014), 
for n > C m i Tl logp/(4e 2 C' ma2; /c 4 ), with probability larger than 1 — 2 p~° 2 with 
C 2 = C m i n /{2de 2 K, A C rnax ), there exists a computationally feasible V such that 


||VS-/||oo < 



(A.l) 


holds. In this case, we can take r k = k 



A.2. Method 

This algorithm takes again three parameters : 5, K and B. We have the same 
interpretation for 8 as in the matrix regression setting, K is an upper bound on 
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two times the sparsity of 9 (again, it does not need to be tight as long as tk is 
small enough), and B is a loose bound on the norm of 9. 

First, we set the initial values for the estimator 9° and the threshold T 0 such 
that 

0° = 0, T 0 = B. 

Then we update thresholds in each iteration r € N*, by 

T r = 2r K T r -i + v, 

where v = 2/M— / 6 ' 1 where M = maxdiag(V r SV rT ). Recall that the pseudo 

inverse V of E and are taken from Assumption A.l. 

Set now recursively, 

a r = l-VX T (Y- , 

n 

and 

9 r = 6 r ~ 1 +a r . (A.2) 

This procedure provides a sequence of estimates, and as we will prove in the 
next subsection, this sequence is with high probability close to the true 9 as 
soon as r is of order log(n) (see Theorems A.l and A.2). 

Remark A. 2 (Iterative hard thresholding (IHT)). The proposed method mod¬ 
ifies iterative algorithms (see e.g. Blumensath and Davies, 2009; Needell and 
Tropp, 2009). The usual (normalised) IHT algorithm updates the estimate using 
9 r = Pff9 r ~ l + w r ~ 1 X T (Y — X0 r_1 )) where Pk is a hard thresholding operator 
that keeps the largest k elements of a vector and w r ~ 1 gKisa step size that can 
have the interpretation of a Gradient step when it is much smaller than 1. The 
difference is in the thresholding; we update thresholds while they pick the largest 
k values after adjusting the added parts. Most importantly, previous works on 
this estimator only considered the case of a deterministic (small) noise, so their 
analysis is not applicable in our model where the noise is stochastic. 

A.3. Main results 

We now provide a theorem that guarantees that the estimate 9 r in (A.2) has an 
optimal Too risk after 0(log(n)) iterations. 

Theorem A.l. Assume that Assumption A.l is satisfied and that 2 rx < 1. 
Let r = log(n)/log(l/(2rif)) ss 0(log(n)). We have that with probability larger 
than 1 — S, for any k < Kj2, 


sup 

e&B(k).\\8\\ 


1 9-9 r 


< Q 


,<B 


M log (p/6) 


where Co = (B + l J ‘ 2rk ) and M = maxdiag(RER T ). 
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Remark A.3. If the design is obtained as in Remark A.l, then as long as 
K = o{^n/ log(p)), with high probability the assumptions of Theorem A.l will 
hold. 

Theorem A.l provides two side results— L 2 convergence rates and asymptotic 
normality. The first corollary is immediately obtained from the fact that for any 
0 e B(k), ||0|| 2 < \/fc||0||oo- 

Corollary 1. Suppose that the same assumptions and notation used in Theorem 
A.l hold. We have that with probability larger than 1 — 6, for any k < K/2 


sup \\9 r - 9\\ 2 < C 0 
eeB(fe),||0|| oo <s 


kM\og(p/8) 


To prove asymptotic normality, we slightly modify the estimator defined in 
Theorem A.l. This is similar to the de-sparsified LASSO by van de Geer et al. 
(2014) in the sense that we also use a de-sparsified version of our estimator. Con¬ 
sider the estimator 9 r of Theorem A.l (with the same r = log(n)/log(l/(2r^-))) 
and V in Assumption A.l, and define 

9 :=9 r + -VX T (Y - X9 r ). (A.3) 

n 

Theorem A.2. Suppose that the same assumptions and notation used in The¬ 
orem A.l hold. Then, writing Z := -^VX T e and A := \/n(I — Vff)(9 r — 9), 
we have 

y/n{6 - 9) = A + Z (A.4) 

where Z\X ~ J\f(0, ^VT,V T ). If tk = o(l) (e.g. for designs as in Remark A.l, 
we have Vk = 0(K-J (log p)/n) so it suffices that I\ = o(y/n/ log p)) then we 
also have 

l|A||oo = Op(l). 

The estimate we provide has similar properties as in Javanmard and Monta- 
nari (2014); van de Geer et al. (2014). 


A.4- Proof of Theorem A.l 

We have 

\\(9-9 r - 1 )~-VX T (Y~X9 r - 1 )\\ 00 = \\(9-9 r - 1 )~-VX T (X9 + e~X9 r - 1 )\\ 00 
n n 

= \\(9-e r - 1 )-v£(9-d r -')--vx T e\\ 00 

n 

< \\{9~9 r - 1 )-V£{9-9 r - 1 )\\ 00 + \\-VX T e\\ 00 . (A.5) 

n 

Since e ~ A/*(0, / n ), we know that 

-VX T e~N{0, 1 VtV T ). 
n n 
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By an union bound (with Hoeffding’s inequality) we know that with proba¬ 
bility larger than 1 — <5 


1^11 


<2 \IM 1 °^ P/S) =v. 


(A.6) 


Let £ be the event of probability 1 — 5 where the previous equation is satisfied. 
We have by Assumption A. 1 if 9 — 9 r ~ 1 is k sparse 


\\Vt(9 - 9 r ~ x ) - (i 9 - d r - 1 )||oo < r fc ||0 - 0 r - 1 ||oo. (A.7) 


Combining Equations (A.5), (A.6) and (A.7) implies that on £, if if 9 — 9 r 1 
is k sparse 

\\(9-6 r - 1 )--VX T (Y-X9 r - 1 )\\ oo <r k \\0-e r - 1 \\ oo + v. (A.8) 
n 

We are going to prove by induction that on £, 

\\9-9 r \\ 00 <2T r , 

and that the support of 9 r is included in the support of 9. 


1. Initialisation: Consider r = 0. Since 9° = 0, its support is included in the 
support of 9. Moreover, by definition of B , we have that 

\\9-9°\\ oo = \\9\\ oo <B<2T 0 . 

This concludes the proof for r = 0. 


2. Induction step: Assume that for a given r, on £ 

\\0-6 r \\ oo = \\(9-e r - 1 )-a r \\ oo <2T r . 

We moreover assume that the support of 9 r is contained in the support of 9, 
which implies that it is k sparse. 

By Equation (A.8) we know that on £, since 9 — 9 r is k sparse 

|| -VX t {Y - X9 r ) - (9 - 0 r )|| oo < r k \\9 - <? r ||oo + v 
n 

< 2 r k T r + v < T r+ i, (A.9) 

since T r+ 1 = 2rjfT r + v. Since a r+1 = [^VX T (Y — X9 r )_| T ^ we have that 

on £, by Equation (A.9), all the coordinates j of a r+1 such that (9 — 9 r )j = 0 
are set to 0. This implies that the support of a r+1 (and thus the support of 
Qr+i _ gr _|_ j nc i uc t ec i j n the support of 9 on £. Therefore, a r+1 is 

fc—sparse on Also, still since a r+1 = VX T (Y — X9 r ) J T , we have that 

\\-VX t (Y - X9 r ) - < T r+1 , 

n 
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and this implies together with Equation (A.9) that on £, we have 

\\e - 0 r+1 lloo = ||(0 - e r ) - a^lloo < 2 T r+1 . 

This concludes the proof for r + 1. 

The induction is complete, and we have that the previous equation holds for 
all r > 1. It is equivalent to the fact that on £ (and thus with probability larger 
than 1 — (5), for all r > 1 

||0-0 r |U < 2T r , (A.10) 

and the support of 9 r is included in the support of 9. 


3. Study of the sequence T r The sequence T r is such that 
T r = 2 rjfT r _! + v and T 0 = B. 

A simple induction on this geometric sequence provides that 


T = 

-L r 


1 

1 -2r K L 


(2r K ) r ((l - 2r K )B - v) + v < (2 r K ) r B + v/{l 


2 r K ). 


4. Conclusion Let r = — log(n)/log(2rx) ss 0(log(n)), since 2 tk < 1 and is 
a constant. We have by Equation (A.10) and by the recursion on T r that on £ 


||0 - 9 r 


B 


1-2 r K 


< B 


1-2 r K 


Af log (p/S) 


(A.ll) 


A.5. Proof of Theorem A.2 


By definition, 


\fn(9 — 6) = y/n 



9) + -VX T (X9 - X9 r ) + -VX T e 
n n 



9) - Vt{9 r - 0)) 


+ /=VX T e = A + Z. 

y/n 


Given X, we know that Z is a linear function of the Gaussian vector e, thus 

Z\X ~ N(0,VtV T ). 

Now we prove the bound for A. Note that using (A.l) and = 0(ky/ (log p)/n), 
for a sufficiently large n, we have a constant C > 0 such that 

IIAIloo = y/n\\(I - vt)(9 r - 0)1100 < Ck^p\\9 r - 0||oo. 

Then using the result from Theorem A.l, with probability at least 1 — S, we 
have as long as k = o(y/n/ logp) 


||A||oo < CC 0 k 


M log (p/d) 


/n 


0, 


as n —> oo. 
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